The model implements a free-operant reinforcement learning process over fine-grained time bins of width \(\Delta t\) (typically 50 ms). Within each bin the agent may emit at most one response. Actions are denoted \(a \in \{x, y\}\), and outcomes (rewards) can occur in the same or later bins after a stochastic delay.
Eligibility traces use a Dutch formulation with per-bin decay \[\lambda_{\text{bin}} = \exp\left(-\frac{\Delta t}{\tau_e}\right).\] State variables per bin \(b\) include the action values \(Q_b(a)\), Dutch traces \(z_b(a)\), the reward \(r_b\), the running average reward \(\bar{R}_b\), and a one-hot action vector \(\mathbf{x}_b\) capturing the response that occurred in the bin (or zeros if no response).
Temporal-difference errors and updates follow: \[ \delta_b = r_b - \bar{R}_b, \] \[ \mathbf{z}_{b+1} = \lambda_{\text{bin}} \mathbf{z}_b + \left(1 - \alpha \lambda_{\text{bin}} \mathbf{z}_b^\top \mathbf{x}_b\right) \mathbf{x}_b, \] \[ \mathbf{Q}_{b+1} = \mathbf{Q}_b + \alpha \, \delta_b \, \mathbf{z}_b, \] \[ \bar{R}_{b+1} = (1 - \eta)\bar{R}_b + \eta r_b. \] Here \(\alpha\) is the per-bin learning rate, \(\eta\) controls how quickly the average-reward baseline tracks recent outcomes, and \(\tau_e\) sets the eligibility horizon in continuous time.
Whether to respond uses a vigor function over time since the last
press (\(t_{\text{diff}}\)) and total
action value: \[
\text{logit}(p_{\text{respond}}) = \beta_0 + \beta_t \, t_{\text{diff}}
+ \beta_q \, (Q_x + Q_y).
\] Responding is prohibited during a refractory period \(\phi\), which is an explicit free parameter
(see refractory_s in the table below) rather than an
emergent quantity of the vigor equation.
Conditional on responding, the agent samples an action from a sticky softmax: \[ m_b(a) = \kappa Q_b(a) + \omega \, \mathbb{I}[a = a_{\text{last}}], \] \[ P(a) = \frac{\exp(m_b(a))}{\sum_j \exp(m_b(j))}. \] The inverse-temperature \(\kappa\) tunes explore/exploit balance, while \(\omega\) biases the agent toward repeating the previous action.
In the earlier white paper, I conceived of response probability as an exponential recovery function
\[ \begin{align} p(\textrm{respond}_{t,b}) &= 1 - \textrm{exp}\left[ - (\nu + \gamma Q^*)(\tau - \textrm{rt}_\textrm{last}) \right] \\[0.3cm] Q^*_{t,b} &= \sum_{a=1}^{2} Q_{t,a,b} \end{align} \]
In the current code/parameterization, this has switch to a logit hazard model. I need to think more about this change. But here is an initial AI perspective on it:
The current simulator/Stan code borrowed the logistic hazard because it made earlier vigor parameterizations compatible with existing binary-response datasets.
The exponential recovery form is a continuous-time hazard model derived from a renewal process: the probability grows as \((1 - e^{-(\nu+\gamma Q^*)(\tau - \text{rt}_\text{last})})\).
Each response can start a reward “job” with action-specific probability \(p_{\text{trigger}}(a)\). Rewards are delivered after an exponentially distributed delay with mean \(\tau_{\text{delay}}(a)\) and magnitude \(R(a)\). Dutch traces distribute credit backward through time, enabling the model to learn despite the delays.
| Parameter | Interpretation |
|---|---|
| alpha | Learning rate for action values (per bin) |
| eta | Learning rate for the average-reward baseline |
| tau_e | Eligibility time constant (s) |
| beta0_vigor | Baseline logit for responding |
| beta_t_vigor | Effect of time since last response on vigor |
| beta_q_vigor | Effect of total value (Qx + Qy) on vigor |
| refractory_s | Hard refractory period (s) |
| kappa | Inverse temperature for softmax (explore/exploit) |
| omega_sticky | Stickiness weight favoring the previous action |
| p_trigger | Action-specific probability a response schedules a reward |
| delay_mean_s | Mean delay before reward delivery (s) |
| reward_amount | Reward magnitude per action |
The simulator in eligibility_test.R defines a set of
illustrative scenarios that vary key learning and control parameters.
Each scenario is simulated for 4,000 bins (200 s) unless otherwise
noted, and the table below summarizes core behavioral metrics.
| scenario | description | responses_per_s | choice_x | reward_rate | avg_delta |
|---|---|---|---|---|---|
| Baseline symmetric | Balanced environment with default parameters. | 0.600 | 0.258 | 0.280 | 0.001 |
| Pure exploration | Low kappa without stickiness keeps choices balanced under symmetry. | 0.525 | 0.371 | 0.245 | 0.001 |
| Sticky streaks | Low kappa plus strong stickiness produces long streaks on the same action. | 0.605 | 0.157 | 0.280 | 0.001 |
| Deterministic symmetric | High kappa with zero stickiness in a symmetric task locks in after early noise. | 0.640 | 0.992 | 0.310 | 0.001 |
| Asymmetric payoffs | High kappa plus richer X rewards yields strong exploitation. | 0.880 | 0.989 | 0.752 | 0.003 |
| Delayed rewards, short trace | Long reward delays with short eligibility (tau_e) hinder credit assignment. | 0.425 | 0.412 | 0.240 | 0.000 |
| Long eligibility | Longer traces (tau_e) and slower learning spread credit over time. | 0.505 | 0.416 | 0.265 | 0.001 |
| Fast baseline | Large eta lets the average-reward baseline track payoffs quickly. | 0.465 | 0.892 | 0.245 | 0.000 |
| Slow baseline | Tiny eta keeps the baseline sluggish, yielding larger delta swings. | 0.575 | 0.896 | 0.265 | 0.005 |
| High vigor | Elevated vigor parameters yield denser responding despite refractory limits. | 5.845 | 0.010 | 2.950 | 0.004 |
| Hard refractory | Half-second refractory windows create pronounced pauses between responses. | 0.365 | 0.658 | 0.180 | 0.000 |
| Weak vigor | Lower vigor intercept/slope suppresses response initiation. | 0.110 | 0.364 | 0.075 | 0.000 |
The following sections document each scenario in detail, including
the motivation, parameter adjustments relative to the defaults, observed
metrics, and the full set of diagnostic plots produced by
plot_dynamics().
Balanced environment with default parameters.
Low kappa without stickiness keeps choices balanced under symmetry.
Low kappa plus strong stickiness produces long streaks on the same action.
High kappa with zero stickiness in a symmetric task locks in after early noise.
High kappa plus richer X rewards yields strong exploitation.
Long reward delays with short eligibility (tau_e) hinder credit assignment.
Longer traces (tau_e) and slower learning spread credit over time.
Large eta lets the average-reward baseline track payoffs quickly.
Tiny eta keeps the baseline sluggish, yielding larger delta swings.
Elevated vigor parameters yield denser responding despite refractory limits.
Half-second refractory windows create pronounced pauses between responses.
Lower vigor intercept/slope suppresses response initiation.